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Abstract —Gene regulatory network reconstruction is a funda¬ 
mental problem in computational biology. We recently developed 
an algorithm, called PANDA (Passing Attributes Between Net¬ 
works for Data Assimilation), that integrates multiple sources 
of ’omics data and estimates regulatory network models. This 
approach was initially implemented in the C++ programming 
language and has since been applied to a number of biological 
systems. In our current research we are beginning to expand 
the algorithm to incorporate larger and most diverse data¬ 
sets, to reconstruct networks that contain increasing numbers of 
elements, and to build not only single network models, but sets of 
networks. In order to accomplish these “Big Data” applications, it 
has become critical that we increase the computational efficiency 
of the PANDA implementation. In this paper we show how to 
recast PANDA’s similarity equations as matrix operations. This 
allows us to Implement a highly readable version of the algorithm 
using the MATLAB/Octave programming language. We find that 
the resulting M-code much shorter (103 compared to 1128 lines) 
and more easily modifiable for potential future applications. The 
new Implementation also runs significantly faster, with increasing 
efficiency as the network models increase in size. Tests comparing 
the C-code and M-code versions of PANDA demonstrate that 
this speed-up is on the order of 20-80 times faster for networks 
of similar dimensions to those we find in current biological 
applications. 

I. Introduction 

Rapidly evolving genomic technologies are providing bio¬ 
logically informative data of unprecedented volume, velocity, 
and variety, with the potential to yield new insights into the 
processes driving disease. This data has allowed us to develop 
a more unified understanding of how many different types of 
interactions at multiple and vastly different scales can influ¬ 
ence biological systems. We now appreciate that changes in 
cellular states involve simultaneous alterations to the genome, 
epigenome, transcriptome, metabolome, and proteome of the 
cell. These are often characterized by complex networks whose 
structures are altered as the phenotype changes. The generation 
of Big Data in this area now presents an unprecedented 
opportunity to develop scalable methods capable of modeling 
these networks, informing us about the nature of disease and 
ultimately allowing us to hypothesize about the therapeutic 
approaches most appropriate for each disease state. 

Working in this area, we recently developed and published a 
method, called PANDA (Passing Attributes between Networks 
for Data Assimilation), that uses a “message passing” approach 


to integrate multiple types of genomic data and construct di¬ 
rected genome-wide regulatory networks HI. PANDA models 
network interactions as communication between “transmitters” 
and “receivers”, referred to in the original publication as “ef¬ 
fector” and “affected” nodes. This approach recognizes that for 
communication to occur, both the transmitter and the receiver 
have an essential role. By constructing a “prior” regulatory 
network consisting of potential routes for communication 
and integrating with other sources of information, PANDA 
estimates the responsibility and availability of each potential 
interaction, predicts where communication is succeeding or 
failing, and deduces condition-specific network structures. 
While many methods exist for inferring relationships in a gene 
regulatory network El, 0, PANDA represents the first method 
that incorporates multiple biological data types naturally, by 
comparing them in order to emphasize common elements. 
Most importantly, this approach provides a unified modeling 
framework for integrating multiple and heterogeneous biolog¬ 
ical data types. 

We have now applied PANDA to explore the effects of 
smoking in knockout mouse models a, to search for potential 
drug candidates in ovarian cancer a and to build sex-specific 
regulatory networks in chronic obstructive pulmonary disease 
(coPD) a. In each of these applications we compared pairs 
or sets of networks reconstructed using PANDA to uncover 
regulatory mechanisms that would not have been identified 
using gene-based approaches. The importance of comparing 
network states highlights the need for a computationally 
efficient implementation of the PANDA algorithm. Indeed, 
much of our ongoing and future research now revolves around 
computing and comparing multiple versions and various sets 
of these akeady-large network models. 

The PANDA algorithm for network reconstruction was 
originally implemented in the C++ programming language. In 
order to increase code readability (to facilitate future modifi¬ 
cation and data integration) and to enable greater scaling of 
the algorithm, which will be necessary in order to incorporate 
larger and increasingly diverse data-sets, we have implemented 
a version of the algorithm in the MATLAB/Octave program¬ 
ming language. We find that this M-code implementation of 
PANDA runs significantly faster than the C++ version across 
a range of network sizes. The M-code is also much shorter 



Fig. 1. The PANDA network reconstruction algorithm models two types of 
nodes: “effectors” (circles) and “affected” (squares). It also considers three 
types of networks: “cooperativity” (between pairs of “effector” nodes, brown 
lines), “regulatory” (from “effector” to “affected” nodes, red lines), and “co- 
regulatory” (between pairs of “affected” nodes). 

(103 versus 1128 lines). In this paper we outline the approach 
we took in re-coding the algorithm. Namely, we begin by 
reviewing the mathematical framework used by PANDA. We 
then show how PANDA’s similarity equation can be re-written 
as products of matrices, enabling us to take advantage of 
MATLAB’s extreme optimization of these operations. Finally, 
we systematically compare the computational efficiency of the 
previous and new implementations of the algorithm using a 
range of input data sizes, including those we often encounter 
in real-world biological applications. 

II. Approach 

A. Finding Agreement between Networks using PANDA 

Transcriptional regulation depends on a complex relation¬ 
ship between transcription factors (TFs) and their downstream 
targets. To model this process, we developed PANDA, a 
“message-passing” algorithm for reconstruction of gene reg¬ 
ulatory networks. PANDA considers two types of network 
nodes, “effectors” and “affected”, and models three types of 
network relationships between these nodes (see Figure [^. In 
the context of biological regulatory networks, transcription 
factors can be viewed as effector nodes that regulate (affect) 
the behavior of their downstream target genes. 

An overview of the PANDA algorithm is presented in Figure 

To begin, PANDA takes input information pertaining to the 
relationships between effector and affected nodes and con¬ 
structs three “seed” networks. A symmetric network between 
effector nodes (of dimensions A’e x Nf., where A), is the 
number of effector nodes) is referred to as the “cooperativity” 
network (P). Similarly, a symmetric network between affected 
nodes (of dimensions Na x Na, where Na is the number of af¬ 
fected nodes) is referred to as the “co-regulatory” network (C). 
Finally, a non-symmetric network from effector to affected 
nodes (of dimensions x Na) is called the “regulatory” 
network {W). 

In the context of transcriptional data, PANDA reads in 
expression information and estimates an initial co-regulatory 
network by calculating the Pearson correlation between pairs 


of genes across all the samples in the data. The cooperativity 
network is initially defined based on a user-provided set of in¬ 
teractions between pairs of transcription factors. To address the 
incompleteness of biological data, in the absence of expression 
or protein interaction information PANDA will initialize these 
two networks as identity matrices. Finally, to create the initial 
regulatory network, PANDA reads in predicted transcription 
factor-gene relationships, which are often estimated by us¬ 
ing DNA sequence information to create a edge between a 
transcription factor (effector node) and a target gene (affected 
node) if a known binding “motif” JT) for that transcription 
factor exists in the promoter region of the gene. In order 
to effectively combine information from these diverse data 
sources, PANDA performs a Z-score normalization on each 
of the networks such that the edge-weights contained in their 
corresponding matrices all are in the same unit-space and of 
similar distribution. 

After reading in and normalizing the input data, PANDA 
performs a message-passing procedure to slowly integrate 
the information contained in the three initial networks. One 
important aspect of this approach is its emphasis on agreement 
among network neighborhoods rather than direct targeting 
information. For example, unlike many other network re¬ 
construction approaches 111, 13, PANDA does not infer a 
relationship between a transcription factor and a gene directly 
from that pair’s co-expression; instead, an edge is inferred if 
the gene is co-expressed with other targets of the transcription 
factor. By performing iterative “soft” updates on each of 
the networks, the models slowly accumulate evidence for 
interactions that is shared across all of the input data sources, 
eventually moving to consensus networks that better explain 
the full set of observations. This method thus serves to infer 
comprehensive new biology that would not be obvious based 
on any single data-type alone. 

Supporting this idea, at the heart of PANDA is an equation 
that evaluates the similarity between sets of interactions in two 
networks: 

T^y = T{x,y) 

_ x-y 

~ VPP + ||yP-|f.y-| (1) 
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This equation is used repeatedly throughout the PANDA 
message-passing procedure. Below we detail how this equation 
has been and can be implemented to best take advantage of 
the strengths of different coding languages. We show that 
changing how we formulate this and the other mathematics 
in PANDA can have a dramatic impact in the implementation 
and can drastically speed up algorithm. 

B. The PANDA Message-Passing Algorithm 

The message-passing framework of PANDA is designed to 
slowly merge information from different network structures, 
initially derived from an underlying set of biological data. 
In order to do this, PANDA iteratively updates each of three 



























results and exit 


+ In the absence of input data , default initialization is as an identity matrix 
t the update of diagonal elements follow a slightly different equation, see text. 

Fig. 2. Overview of the steps included in the PANDA network reconstruction 
algorithm. A significant portion of the algorithm involves a message-passing 
procedure in which information from the initial data is slowly merged together. 


initial networks. Namely, at each iteration, t, PANDA first 
updates the regulatory network based on the responsibility (R) 
and availability (A) of edges: 

= (1 - + |(Ag + R^) (2) 

The responsibility of an edge between TF i and gene j in the 
regulatory network is calculated based the previous level of 
agreement between the set of transcription factors that target 
gene j and those that cooperate with transcription 

factor i 

= (3) 

Similarly, the availability of each edge is based on the level 
of agreement between the regulatory targets of transcription 
factor i and the set of genes with which gene j is 

co-regulated 

( 4 ) 


Next PANDA updates the protein cooperativity and gene co- 
regulatory networks by comparing the sets of genes regulated 
by a pair of TFs, and the sets of TFs targeting a pair of genes, 
respectively. Namely: 


P^ra = (1 - 

eg, = (1 - a)cg-g + ar([ivi‘g, wf) 

To satisfy convergence criteria, the diagonal elements of P 
and C are updated based on the off-diagonal elements of P 
and C, respectively. More specihcally: 

where ai is the standard deviation across the non-diagonal 
elements in row i of P and cr, is the standard deviation 

across the non-diagonal elements in row j of C 

Finally, at each iteration, when computing PANDA 

also calculates the hamming distance between the previous 
and predicted regulatory network: 


The updates (of W, then P and C) are iteratively repeated 
until convergence, which is dehned as h < 10“^. Depending 
on the update parameter, a, this can involve anywhere from 
ten to several hundred iterations of the message-passing pro¬ 
cedure; the current suggested default value for a is 0.1 and 
involves approximately 40 iterations. The result of the PANDA 
message-passing approach is a value associated with each of 
the edges in all three networks that reflects shared structure 
across the original set of input data. 


C. Implementing PANDA using Matrix Algebra 

T (Equation is used to calculate the values for each 
edge in A, R, C and P (see Equations [3]|^. As it is written. 
Equation [T] returns a single number, namely, a score for 
an edge from node i to node j in one network, based on 
vectors containing values from the row and column 
of matrix representations for two other networks. In the 
C-H- implementation, this is calculated via summations within 
nested for-loops. Below we show an example of the general 
framework for performing this operation, based on predicting 
the availability (A, equation for all the edges. Note that 
this is not exactly replicated from the C-H- code (available at 
http://sourceforge.net/projects/panda-net/), but does accurately 


































represent the way the similarity calculation is implemented; 

for{i = 0]i < Ne; i + +){ 
for{j = 0;j < Na\j + +){ 

Avar = 0; Bvar = 0; Cvar = 0; 
for{k = 0;k < Ne\ k + +){ 

Avar+ = C[j].tar[k] * 14^[i].tar[fc]; 

Bvar+ = C[j].tar[k] * C[j].tar[k]; 

Cvar+ = iy[i].tar[fc] * VF[*].tar[/c]; 

} 

A[z].iar[j] = Avar / sqrt{Bvar + Cvar — fabs{Avar)) 

} 

} 

( 8 ) 

Here we point out that instead of solving for each element 
of the T matrix individually, as it done in the code-snippet 
shown above, one can alternately solve for the entire matrix of 
T-values. Using the MATLAB/Octave programming language, 
this can be written as a series of matrix operations contained 
in the following 5 lines of code; 

function Amat = Tfunction{X,Y)] 

Amat = {X *Y)] 

Bmat = repmat{sum{YA2, 1), size{X^ 1), 1); (9) 

Cmat = repmat{sum{XA2, 2), 1, size{Y, 2)); 

Amat = Amat. / sqrt{B mat + Cmat — abs{Amat))', 

This matrix-oriented implementation requires the simultaneous 
storage of three potentially very large matrices {Amat, Bmat 
and Cmat) in addition to the original given networks (X and 
Y). On the other hand, the C-n- implementation shown above, 
by estimating each element of T separately, only requires 
storing one of these matrices (A). Thus by solving for all the 
elements of the matrix at once we have increased the memory 
requirements for this process. 

We have evaluated how to represent all of PANDA’s mathe¬ 
matics, including the calculation of T, as matrix operations 
and re-implemented the algorithm in the MATLAB/Octave 
programming language (hereafter referred to as the M-code). 
This PANDA implementation includes five hies; 

1) RimPANDA.m: The master M-hle that reads in the input 
data, constructs the initial networks, runs the message¬ 
passing algorithm and prints the hnal network estimates 
to a hie. This code calls two functions contained in other 
hies; NormalizeNetworkO and PANDA(). 

2) NonnalizeNetwork.m: Performs a Z-score transforma¬ 
tion on a given input network. 

3) PANDA.m: Given set of three networks; W, C and P, 
performs the core message-passing procedure. This code 
calls two functions contained in other hies; Tfunction() 
and UpdateDiagonal(). 

4) Tfunction.m: Computes the similarity between two net¬ 
works (Equation [^1. 


5) UpdateDiagonal.m: Updates the diagonal elements of a 
given symmetric network (Equation |^. 

The M-code can be downloaded at; https;//sites.google.com/a/ 
channing.harvard.edu/kimberlyglass/tools/panda 

D. Comparing the M-code and C-code PANDA Implementa¬ 
tions 

We have used cloc to evaluate the number of lines of code 
in this new M-code implementation and compared that to the 
number of lines of code in the original C-code implementation 
of PANDA. We hnd that the M-code version of PANDA is 
approximately ten-times shorter (103 lines) than the the C- 
code (1128 lines). A small portion of this reduction in line- 
count can be attributed to additional features found in the C- 
code that we did not fully implement in the M-code. However, 
it is unlikely that adding these to the M-code would increase 
the line-count signihcantly. Eor example, the C-code includes 
a parameter that allows the user to randomize node-labels upon 
input; such an addition to the M-code would require only a 
few lines. In addition to being shorter, it is also important to 
emphasize that the new M-code implementation of PANDA 
is also highly human-readable, especially in comparison to 
the C-code. Thus integrating any “missing” features into the 
M-code, such as this randomization procedure, will be less 
cumbersome than the future introduction of other new features 
into the C-code. 

In addition to code-length, we have also tested the speed 
of the C-code and M-code implementations of PANDA. In 
order to do this, in the C-code we added a call to the 
clock() function immediately prior to and immediately after 
the message-passing procedure. Time in seconds was then 
determined by multiplying the difference in these calls by 
CLOCKS_PER_SEC. Eor the M-code, the run-time of 
the message-passing procedure was determined by the tic/toc 
function, again placed immediately prior to and past the 
message-passing procedure. By placing clock() and tic/toc 
around the message-passing procedure, we are not addressing 
any single time-costs that are associated with reading in or 
parsing the original data, or in outputting the results. 

Eor this speed-test we constructed one hundred random 
initial regulatory networks {W) at each of several various sizes 
(We = Na^ {125,250,500,1000,2000}). Eor the other two 
networks (P and C) we took advantage of PANDA’s default 
behavior wherein those are initialized as identity matrices in 
the lack of external information (see Eigure |^. We compiled 
the C-code using g-n- with an optimization flag (—03). We 
ran the M-code using both GNU Octave (version 3.8.2) and 
MATLAB R2014a (8.3.0.532) with the -singleCompThread 
flag (see below for more information). All analyses were run 
on the MIT SuperCloud which uses the x86_64 GNU/Linux 
operating system 0. 

The average and standard deviation in run-time across the 
one-hundred tests for each primary network size is shown in 
Table H] We find the MATLAB and Octave runs of PANDA 
have comparable run-times and both are significantly faster 
than the C-code implementation. Even more importantly, the 


fold-improvement in speed in the MATLAB and Octave runs 
increases as the network sizes increase. 


E. Taking Advantage MATLAB’s Built-in Multi-Threading Ca¬ 
pabilities 

In order to more fairly compare results between the com¬ 
piled C-code and the M-code runs of PANDA in both MAT¬ 
LAB and Octave, the results presented in Table |I] ran the 
M-code implementation of PANDA after opening MATLAB 
using the -singleCompThread option. However, by default, 
MATLAB has the ability to take advantage of the multi¬ 
threading capabilities of the computer on which it is installed. 
Therefore, next we determined if an even greater speed-up 
might be obtained by taking advantage of this multi-threading 
capability. To test this, we used the approach described above 
and constructed one single random input network for each of 
a range of sizes. We then evaluated the time it took to perform 
the message-passing procedure using the compiled C-code and 
the M-code when running MATLAB using either the single¬ 
thread or multi-threading (default) option. 

The results of this test are plotted in Figure We hnd 
that by taking advantage of MATLAB’s multi-threading ca¬ 
pabilities we are able to decrease the run-time of the M- 
code significantly, and that this improvement continues to 
increase as the size of the network grows. In fact, for net¬ 
works with several thousand nodes, the single-thread M-code 
improvement compared to the C-code is about 30-fold, but the 
multi-thread M-code run-time improvement compared to the 
C-code is over 100-fold. Although the magnitude of this speed¬ 
up will be dependent on hardware of the computer system 
on which MATLAB is installed, the real-life implications 
are profound. A network that would previously have taken 
several days to reconstruct using the C-code implementation 
of PANDA could be run in a matter of hours on a computer 
system with multi-threading capabilities. So far our tests of 
the message-passing procedure have not considered networks 
that take more than a few hours to reconstruct, even using the 
C-code. As we explain below, large networks necessitating a 
very long computation time are often encountered in the field 
of computational biology. 


II 

C-code 

Octave 

MATLAB 

125 

1.05 ±0.01 

0.50 ±0.03 

0.13 ±0.04 

250 

4.65 ±0.03 

0.62 ±0.01 

0.54 ±0.03 

500 

32.73 ±0.07 

2.82 ±0.03 

2.61 ± 0.002 

1000 

241.51 ± 0.12 

15.82 ± 0.09 

15.06 ±0.04 

2000 

1954.5 ±33.12 

98.59 ±0.89 

95.04 ±0.25 


TABLE I 

A COMPARISON OF THE AVERAGE RUN-TIME, IN SECONDS, OF THE 
C-coDE AND M-code implementations oe the PANDA algorithm 
ACROSS too RANDOM INPUT REGULATORY NETWORK MODELS. THE 
M-code was run using both Octave and MATLAB. 




Fig. 3. Comparison of times when running the M-code implementation 
of PANDA from within MATLAB using either the default multi-threading 
capabilities, or using only a single computational thread. 


F. Real-world Implications for Faster Network Reconstruction 

In biological systems, networks are often on the order of 
thousands, rather than hundreds of nodes. Even the genomes of 
“smaller” organisms contain thousands of genes (approximate 
six thousand for yeast and thirteen thousand for fruit fly). 
Current research on the human genome puts the number 
of protein-coding genes at approximately twenty to twenty- 
hve thousand. Fortunately, the gene regulatory networks for 
biological systems are often highly a-symmetric with respect 
to their “effector” and “affected” nodes, with the number of 
transcription factor regulators (“effector” nodes) typically only 
5-10% that of genes. With these real-world values in mind, we 
tested the speed of our C-code and M-code implementations 
of PANDA across a set of “realistic” network sizes. Here we 
again ran MATLAB with the -singleCompThread option for a 
more fair comparison with the C-code. The results are shown 
in Figure 

Using the initial C-code implementation of PANDA, a 
“human-sized” network reconstruction took over a day. How¬ 
ever, using the new M-code implementation of PANDA this 
same analysis was accomplished in approximately one hour. 
It is also important to emphasize that if we were to take 
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100 
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59.1 

9.8 

200 

4000 

458.6 
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122.5 

400 

8000 

3758.3 

253.8 
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10000 

7298.8 

441.8 

600 
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1019.3 
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97043.8 

3401.3 
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Fig. 4. An evaluate of the time PANDA needs to reconstruct networks that are of similar size to those found in real biological systems. Results from the 
tests in seconds ai‘e included in the table. 


advantage of MATLAB’s default multi-threading capabilities 
this time would be reduced even further. A “real-world” 
network that might take approximately twenty-four hours to 
compute using the C-code could possibly be computed in 
the matter of minutes using the M-code. Overall this has 
significant implications both for biomedical research as well as 
for our ability to provide real-time results as we are beginning 
to integrate network analysis into clinical applications. 

III. Conclusion 

The original implementation of the PANDA algorithm was 
coded in C-H- and determined shared information between 
networks based on a similarity equation implemented through 
a series of nested for-loops. Here we showed how to re-cast 
that equation as a series of matrix operations in order to re¬ 
code PANDA in the MATLAB/Octave programming language. 
We find that this “M-code” implementation not only greatly 
increases code readability compared to the “C-code”, it also 
drastically decreases the time needed to compute the networks. 
Importantly, the M-code implementation also has the potential 
for even further speed-up by taking advantage of MATLAB’s 
built-in multi-threading capabilities, as explored here, or by 
using parallel MATLAB 13 or MATLAB’s GPU support, 
future directions for our group. 

It is important to point out that a similar or even greater 
speed-up of the PANDA algorithm should be be obtainable 
using a lower-level language, including C/C-H-. The reason 
the M-code is faster than the current C-H- implementation 


is because we wrote the M-code in terms of matrix op¬ 
erations in order to take advantage of MATLAB’s built in 
support for the BLAS routines which call the underlying 
SIMD units in the processor. We could have similarity re¬ 
implemented PANDA in terms of matrix operations in C-H-, 
but that would have required redoing all the data structures; 
it was far easier to code the algorithm in MATLAB to take 
advantage of these hardware features. In addition, we suggest 
that the readability we obtained using the MATLAB/Octave 
programming language will have a profoundly positive effect 
on our research abilities. In biological applications we often 
encounter a wide diversity of data-sets that require thoughtful 
custom integration in order to leverage them effectively and 
extract biologically-meaningful results. Future additions and 
modifications of PANDA considering this information will be 
far easier using the MATLAB/Octave programming language 
compared to C-H-. 

In summary, we suggest that the enhanced readability 
coupled with the significant decrease in time-costs associated 
with the M-code implementation of PANDA demonstrates a 
powerful use of the MATLAB/Octave programming language 
in enabling biomedical research. 
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